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^ The proposed smooth blockwise iterative thresholding estimator (SBITE) 

^ is a model selection technique defined as a fixed point reached by it- 

^ erating a hkelihood gradient-based thresholding function. The smooth 

li^j James-Stein thresholding function has two regularization parameters 

^ A and i/, and a smoothness parameter s. It enjoys smoothness like 

^ ridge regression and selects variables like lasso. Focusing on Gaussian 

^ regression, we show that SBITE is uniquely defined, and that its Stein 

unbiased risk estimate is a smooth function of A and i/, for better selec- 

^ tion of the two regularization parameters. We perform a Monte-Carlo 

O simulation to investigate the predictive and oracle properties of this 

^ smooth version of adaptive lasso. 

The motivation is a gravitational wave burst detection problem from 
several concomitant time series. A nonparametric wavelet-based esti- 
^ mator is developed to combine information from all captors by block- 

5^ thresholding multiresolution coefficients. We study how the smoothness 

parameter s tempers the erraticity of the risk estimate, and derive a 
universal threshold, an information criterion and an oracle inequality 
in this canonical setting. 
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1 Introduction 



Assuming a simple Gaussian model Y ~ N(q;, /) with ck G IR , the James and 



Stein| (|1961j) estimator ct^^ = cct^^^ with c = 1 - (P - 2)/||ck^^^||2 proved that 
the maximum likelihood estimate a^L^ is not admissible when P > 2, since James 
Stein's mean squared error is smaller for all coefficients. This gave birth to a class 
of shrinkage or thresholding estimators of the form 



a. 



/ - MLE\ 



'11 



where A controls the regularization. Shrinkage means that II All < llcK^^^II, and 
thresholding means that entries of ck are set to zero to achieve variable selec- 
tion. When applied coordinatewise to o;^^^, thresholding sets some entries of 
a to zero, and when applied blockwise, then all entries are set to zero at once. 
The original James-Stein estimator neither shrink nor threshold, but its truncated 
version a.'^^^ = c+ck^^^ does both blockwise by taking the positive part (i.e., 
c+ = max(c, 0)) of the multiplicative factor. Waveshrink (Donoho and Johnstone 



1994) is a famous example of coordinatewise thresholding for wavelet smoothing. 



Consider now generalized linear models (Nelder and Wedderburn, 1972) with 
observed response and P corresponding covariates Xn = {xn^i, . . . ,Xn,p) orga- 
nized in a matrix X for n = 1,...,N, negative log-likelihood —I (including a 
possible link function) and coefficients ex.. In the following, covariates have been 

mean-centered and E-rescaled into a matrix X with c orresponding coefficients f3, 

" MLE I ~\ T 

so that (3 is homoscedastic in the rescaled basis (Sardy, 2008). In that gen- 



eral regression setting, another class of regularization defines the estimate as a 
minimizer to a penalized likelihood function. 



/3 = argmin-/(X/3;y) + All/311, 



(2) 



where || ■ || is a norm or a semi-norm, and A is the regularization parameter. 
Famous examples of such methods are ridge regression (Hoerl and Kennard, 1970), 



nonparametric smoothing splines (Wahba, 1990[ ), waveshrink, nonnegative garrote 
(Breiman, 1995) or lasso (Tibshirani, 1996[ ). The last three are variable selection 
estimators. There exist exact links between ([T| and ([2]), for example waveshrink. 

One goal of this paper is to achieve variable selection with a new class of es- 
timators called smooth blockwise iterative thresholding estimators (SBITE), that 
iteratively apply a new thresholding function called smooth James-Stein, which is 
governed by a thresholding parameter A, a shrinkage parameter u and a smooth- 
ness parameter s. We will see this class of estimators can minimize penalized 
likelihood functions ^ by iteratively applying smooth James-Stein thresholding 
for (A, u, s) set to specific values. In that sense iterative thresholding encompasses 
existing variable selection methods of the type ([T]) and ^ as particular cases. 
Iterative thresholding goes beyond these methods by adding flexibility, stability. 
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smoothness and uniqueness properties. Iterative thresholding can be done coor- 
dinatewise or blockwise. As far as flexibility is concerned, recent estimators are 
governed by several regularization parameters, for example, bridge (Fu, 1998), 



SCAD (Fan and Li, 2001a), the penalized least squares estimator of Antoniadis 



and Fan (2001), EBayesThresh (Johnstone and Silverman, 2004, 2005), fused lasso 



(Tibshirani et al. 2005), elastic net (Zou and Hastie 2005), adaptive lasso (Zou 



2006) and regularization (Sardy, 2009). It is often beheved that, although these 



estimators are more flexible, their risk may be worse in some situations because 
selection of more than one regularization parameter is unstable. A smooth estima- 
tion of the risk with SBITE will allow a more stable selection of the regularization 
parameters. SBITE also owes its stability (Breiman, 1996) to its smoothness like 
ridge regression. As far as uniqueness and smoothness are concerned, we will see 
that smooth James-Stein iterative thresholding smoothly and uniquely extends 
lasso, which otherwise is not smooth and not necessarily uniquely defined. 

This paper is organized as follows. Section [2] presents our original motivation, 
the detection of gravitational wave bursts using information from several simulta- 
neously recorded time series. It motivates the need for a wavelet-based smoother 
that thresholds blocks of multiresolution coefficients across captors. Section [3] 
presents the new SBIT estimator for generalized linear regression, discusses its 
links to existing estimators, presents uniqueness and smoothness properties, and 
derives its Stein unbiased risk estimate. A Monte-Carlo experiment investigates 
its finite sample properties. Section |4] focuses on block canonical regression, where 
we study tempering of the erratic behavior of the Stein unbiased risk estimate by 
means of the smoothness coefficient of SBITE, and derive a universal threshold, an 
information criterion and an oracle inequality. Finally the estimator is applied to 
gravitational wave bursts and simulated data. Section |5] discusses two extensions. 



2 Motivation 

Gravitational wave bursts are produced by energetic cosmic phenomena such as 
the collapse of a supernova. They are rare, highly oscillating and of small inten- 
sity compared to the instrumental noise, so only the concomitant recording by Q 
captors (typically Q = 3) of the cosmic phenomena may help prove the existence 
of such wave bursts. The captors are located far apart from each other to avoid 
recording local earth phenomena (such as an earthquake) on all Q captors. An- 
other difficulty is the non-white nature of the noise, possibly non-Gaussian. The 
measurements are recorded at a high frequency of 5 KHz: one minute of recording 
has 3Q ■ 10^ noisy measurements. A good model for these data is 

^(^)=/x('')(t)+et\ t = l,...,T, g=l...,g (3) 

where the noises e^'^'' and e^'^'^ are independent between captors q ^ q', but where 
the noise is temporally correlated for a given captor. Importantly, most of the 
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time the underlying signal lJi^'^\t) = for all q, and, if jJ^^^^t) ^ for a given time 
t and captor q, then the same is true for all the other captors. Finally because the 
incoming wave burst may not hit the captors with the same angle, we may not 
have fi^' ^^t) = fi^'^ ^(t), but only a proportio nality constant relates them. 



Like Klimenko and Mitselmakher (2004), we assume each underlying signal fi^'^^ 
expands on T = N orthonormal approximation (p and fine scale ip wavelets: 



2J0-1 



7V,-1 



(4) 



A:=0 



j=jo n=0 



where the wavelets are obtained by dilation j and translation n, J 



see 



and Nj 
mal regression matrix W 



l0g2(iV) 



Donoho and Johnstone (1994). One can extract an orthonor- 



becomes o 



[$o^jo • • • "^j] from this representation such that ([S]) 



Applying the orthonormal wavelet decomposition 

W^, the model can also be written as 5^^^ = (3^"'^ + e^'^), where 5^'') = W^S^'^'^ and 
£(9) = W"^£^'^\ This latter model is interesting in three aspects. First Johnstone 
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Figure 1: EDA from 26 seconds recording: (left) estimated spectrum and (right) 
boxplots of estimated wavelet variances on a log-scale for level 4 to 11. 



and Silverman (1997) show that stationary correlated noise £q is well decorrelated 



by a wavelet transform within and between levels. 



roukh et al. , 2000) that can be estimated from the data (Donoho and 





Percival , 


1995 


1. ""-^ 
Ser- 


^Donoho and 


ohnstone 



1995). The right plot of Figure [T] represents boxplots of such wavelet variances 
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estimated from 31 disjoint signals of length = 4096 for j = 4, . . . , 11. The left 
plot of Figure [T] shows an estimated spectrum from 26 seconds of recording: the 
noise of the captor is a band-pass filtered colored noise process. Second, the data 
are Gaussianized by the linear wavelet transformation. Third, owing to sparse 
wavelet representation, the vectors f3^'^^ are sparse, and the amount of sparsity 
varies between levels, so the selection of hyperparameters should be level depen- 
dent. Hence organizing the coefficients of captor q by levels, model ^ and Q can 
be well approximated at a given level j by 

Yf = (xf + zf with Yf = Sf/af and (xf=f3^f/af, (5) 

where a^'^^ = (a^-^i, • • • , (^^\n^) is a sparse vector and z^-^-* N(0, 1). Importantly, 

given a dilation j and a translation n, then (Xj^n = {c(\\L ■ ■ ■ , ttj> ) is either null or, 
when CKjn ^ 0, then its entries are different. Our goal is to derive an estimator 
that adapts levelwise to block sparsity. 



3 Smooth blockwise iterative thresholding 
3.1 Review of block coordinate relaxation 

Recent estimators consider the situation where the coefficients are blocked into J 
groups /3 = {f3i, . . . ,f3j) of respective sizes pi, . . . ,pj with Y.j=iPj = P- Corre- 
spondingly, let X = [Xi . . . Xj] . For instance, for gravitational wave burst detec- 
tion, wavelet coefficients are grouped into blocks of size Q, the number of captors, 
and Xj is the Q x Q identity matrix for all j = 1, . . . , J. 

We recall an optimization technique upon which we elaborate a new estimator 

" MLE 

in the following section. Suppose for now we want to calculate (3 solution to 
^ for A = 0. Block coordinate relaxation (BCR) works as follows: start with any 
initial guess /3, choose a block j G {1, . . . , J} and update only the jth block jSj 
conditional on the values of the other blocks /3j for all i j, that is 

/3f = arg min -/(J] X,f3, + X,f3-, y), (6) 

and leave the other unchanged to obtain the next iterate 

/3(^) = (/3i,...,/3,._i, /Jf^'' ,/3,.+i,...,/3^). (7) 

Note that j is the index of the updated block, but not of the iteration. 

Property 1 : Assuming that f3 E B, where B = Bi x ... x Bj is a. product 
of closed convex sets, that ^ has a unique solution and that the log-likelihood 
is continuously differentiable, then the algorithm converges to a stationary point 



(Bertsekas, 1999, Proposition 2.7.1). If the negative log-likelihood is also strictly 
convex, then the algorithm finds the MLE. 

Property 2: After updating f3 with f3^^^ according to pn, the gradient of the 



likelihood with respect to the jth block is null, that is Vi3.l{Xf3^^^; y) = 0. 
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3.2 Smooth blockwise iterative thresholding estimator 

The MLE does not achieve variable selection however. To do so, one can test 
the significance of the jth block based on the likelihood and its gradient in the 
following way. Suppose we are at the MLE where the entire gradient vector is 
null. The covariates have been S-rescaled for all MLE coefficients to have unit 
variance as discussed in Section [T} So if after thresholding the jth block of the 
MLE to zero the gradient with respect to the jth block remains small (compared to 
a threshold A), then we declare this block not significant. This suggests calculating 
the likelihood's block gradient at each BCR iteration: 



• at the current iterate Z?'--''-* defined by According to Property 2, the 
gradient with respect to the jth block is null; 

• at the current iterate with f3^j^^^ thresholded to 0, namely at 

/3(^-)-° = (/3i,...,/3^._i, ,/3,.+i,...,/3,). (8) 

The gradient with respect to the jth block is V/37(X/3*-"''*^°; y). 

A difference larger than a threshold A between the two likelihood's block gradient 
norms, that is, ||V^J(X/3(^)^°;y)|| > A, shows that the jth block is significant 
given the value of the other blocks. This leads to the following estimator. 

Smooth block iterative thresholding estimator (SBITE) (algorithmic definition). 
Choose a threshold A > 0, a shrinkage parameter u > 1 and a smoothness param- 
eter s > 1. Let /3 be a root-A^-consistent estimate of (3. 

1. Start with any initial value; 

2. Choose a block j, and calculate f3^^^^ according to (6) and the gradient 
V^i(X/3(^-)-°;y); 

3. Update the jth block according to 



^update ^ _ ^ )!./3lf '^''; (9) 

||/3;r-i||V;3i(x/3(^^^°;y)ii ' 



4. Go back to step 2 until convergence. 

The thresholding function (9) is called smooth James-Stein: when and 
II V/3 7(X/3*-''''^°; y)|| are small, thresholding sets the jth block to zero. We study 
the advantage of the smoothness parameter s later. For s = 1 and certain values of 
z/, SBITE is linked to existing estimators, as established in the following property. 

Property 3 (Gaussian case with a smoothness parameter s = 1): The SBITE 
iterations converge at the limit to the estimate of: 
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1. lasso (Tibshirani 



1996) for s = u = 1 and blocks of size one. Lasso is not an 



oracle procedure (Fan and Li, 2001b) 



2. 
3. 



group lasso (Bakin 1999 Yuan and Lin 2006) for s 



V 



1, with blocks. 



adaptive lasso, which is oracle (Zou, 2006), for s = 1 and v > \ and blocks 



of size one. Adaptive lasso is the motivation for including the norm ||/3j|| 
in (|9]). Hence not only a larger gradient but also a larger root-A^-consistent 
estimate of the jth block leads to milder shrinkage. 

4. waveshrink for s = 1 and X an orthonormal wavelet matrix: soft-waveshrink 
for V = \ and hard- waveshrink when — )■ oo. 



5. truncated James-Stein for = 2, A = \/ P — 2 and a block of size P > 2. 

6. block thresholding for wavelet smoothing for s = 1, p = 2, groups of size 



L > 1 and X an orthonormal wavelet matrix (Cai, 1999). 



For a proof, observe that the SBITE algorithm corresponds to the shooting 



algorithm (Fu, 1998) for lasso, the BCR algorithm (Sardy et al. , 2000) for basis 



pursuit (Chen et al. , 1999) and to the iterative algorithm of Yuan and Lin (2006) 



to minimize penalized least squares problems of the form 



1, 

min - 

/3 2' 



Y-X/3 



.tt 11/3*1 



u-l 



11/3. 



(10) 



Note that the last three estimators above (4, 5, 6) converge after one iteration. 



and ( 10 ) can also be solved by another class of iterative algorithms developed for 



inverse problems (Daubechies et al. 2004). 



Hence SBITE provides a new interpretation of lasso as a sequence of tests based 
on the block gradient of the likelihood evaluated at successive null hypothesis 
Hq : /3j = given the values of the other coefficients, until an equilibrium is 
reached. A legitimate question addressed in the following section is whether such 
an equilibrium can be reached at a unique point. If so, SBITE is defined uniquely. 



3.3 Uniqueness 

Lasso (s = 1) does not necessarily define a unique estimate if the kernel of the 
regression matrix X is not the singleton (Sardy, 2009). On the contrary SBITE 
is uniquely defined under a milder condition when s > 1, as stated in Theorem 1 
below. We first give its fixed point definition. 

Smooth block iterative thresholding estimator (SBITE) (fixed point definition). 
Choose a threshold A > 0, a shrinkage parameter u > 1 and a smoothness param- 
eter s > 1. Let /3 be a root-A^-consistent estimate of f3. SBITE is a fixed point 
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to the SBITE algorithm, which is the solution to the set of P nonlinear equations 



||/3::r-i||V.,/(X/3(^-)-°;y 



|MLE 



where /3'^^^ is defined by (6 ), and each /3 • is a vector of length pj with P = ^2^=1 



Pr 



Despite being highly non-linear and employing a non-convex thresholding func- 
tion, these equations define SBITE uniquely in the Gaussian case when s > 1. 



Theorem 1: For the Gaussian likelihood, the solution to (11) with smoothness 
parameter s > 1 is uniquely defined for all matrices X = [Xi 
XjXj are positive definite matrices for all j = 1, 
differentiable with respect to the data. 



, Xj] such that 
J. It is moreover continuously 



Note that the condition is milder than X'^X being positive definite; for unit 
block size, its means that each column of X must be different from the zero- vector. 
Convergence of the SBITE algorithm is proved when s = 1 for the Gaussian 



likelihood (Fu 1998 Sardy et al. 2000) and for more general likelihoods (Sardy 



and Tseng 



2004). Convergence to the unique fixed point has always been observed 



when s > 1, but remains to be proved. 



3.4 Equivalent degrees of freedom 



SBITE is governed by two regularization parameters A and z/, and a smoothness 
parameter s. For s = 1 and for the Gaussian linear model with mean fj, = X(x and 



P variables grouped into blocks of unit size, Zou (2006) selects the two regular- 



ization parameters A and u of adaptive lasso by cross-validation, a rule known for 
its high computational cost and instability. This section derives instead the Stein 
unbiased risk estimate for any combination of the three parameters (A, z/, s). 

Stein (1981) showed that for an estimator of the form ft = g(Y) + Y and 



unit variance, then the quadratic risk can be estimated unbiasedly if g is almost 
differentiable. For SBITE with blocks of unit size, we have g(Y; A, u, s) = Yl + 
X f3x^^-sC^) ~ Y, where /^^^ ^.^(Y) is the solution to (11) for Gaussian likelihood. 
SBITE is almost differentiable for s = 1 and differentiable for s > 1, so the Stein 
unbiased risk estimate (SURE) for SBITE is 



N 



SURE(A, u; s) = RSS(Aa,.,s) + N + 2J2 dgn{Y; A, s)/dYn 



(12) 



n=l 



where the last term is the so-called equivalent degrees-of-freedom. SURE involves 
the partial derivatives a^„(Y; A, z/, s)/9r„ = 1/N + x™™ ■ VnPx,u,sC^) ~ 1' 
n = 1,. . . ,N, where Vn$\usC^) the derivatives of fB^i^sC^) with respect to 
Yn, and xj!j°" is the nth row of X. The following theorem states that Vn$\usC^) is 
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explicitly defined as solution to a full rank system of linear equations when s > 1 . 
Hence the risk of SBITE can be estimated unbiasedly for all regression matrix. 

Theorem 2: For a given pair A > and z/ > 1 and for a linear estimate 
f3 = AY (e.g., least squares or ridge regression) where the entries of A are noted 
the gradient of the SBITE estimate f3xusC^) with respect to Yn is the solution 



to a system of linear equations (26) that is full rank when s > 1, regardless of the 
existence of a kernel for X, for n = 1, . . . , N . 

Interestingly also from Theorem 2, letting the smoothness parameter s tend to 
one leads to the equivalent degrees of freedom of adaptive lasso; if moreover z/ = 1, 
then the solution to is hfo = ((X^o)TX^o)-1(x^°™)T. Hence, we see that 

N N _ _ 

Y,dgn{Y;X,u,s)/dYr, = 1 + ^ x^^ ■ ((X^")^X^°rHx: 



n=l 



n=l 



l + \To\-N, 



where |Xo| is lasso's degrees of freedom previously found by Zou et al. (2007). 



The smoothness parameter should not be considered as a third regularization 
parameter like A and z/, but more like a device to bring smoothness to the estimator 
and combat the increasing erraticity of the two-dimensional SURE function as u 
grows. Section quantifies SURE's erraticity tempered with the smoothness 
parameter s. The smoothness parameter should not be too large however, since 
it contradicts the goal of a large u to approach hard thresholding, and since the 



constant of the oracle inequality (21 ) of Theorem 4 increases with s. A good trade- 
off is for instance siu) 



2 log v + l (see Theorem 3 below). Figure |2] illustrates the 
gain in smoothness by calculating SURE for the prostate cancer data with P = 8 



covariates (Tibshirani, 1996), and by comparing the smoothness of the estimated 



risk either with adaptive lasso (left) or with its smooth extension (right). Both 
risk estimates are unbiased, but the second is less erratic thanks to s > 1. 

We have assumed unit variance. In practice, one can either estimate the vari- 
ance and rescale responses to have approximate unit variance, or, in the spirit of 
generalized cross validation (Golub et al. , 1979[ ), one can use generalized SURE 

GSURE(A, z/; s) = .r—j^ — ' ' , (13) 



:i-jrELid9n{Y;X,u,s)/dY^y 



To minimize SURE or GSURE over (A, z/) for s = 21ogz/ + 1, our strategy 
consists in minimizing it first for s = 1 (i.e., adaptive lasso) which can be done 
efficiently on a fine grid thanks to lars ( [Efron et al. 2004). This provides a neigh- 
borhood for A, namely [A'-^^^^IO, lOA'-*^^^], over which we then minimize SURE 
on a more local grid for smooth adaptive lasso {s > 1) calculated with the SBITE 
algorithm. This strategy provides a good and efficient selection of the pair of 
hyperparameters (A, u), as demonstrated by Monte-Carlo below. 
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Figure 2: Prostate cancer data: Stein unbiased risk estimate as a function of A and 
z/ for adaptive lasso (left, s = 1) and smooth adaptive lasso (right, s = 21ogz/ + 1). 



3.5 Monte-Carlo simulation 



We replicate the Monte-Carlo simulation of Zou (2006) with P = 8 covariates with 
corresponding coefficients either sparse a = (3, 1.5, 0, 0, 2, 0, 0, 0) for Model 1 with 
e {20, 60}, or not sparse a = (.85, .85, .85, .85, .85, .85, .85, .85) for Model 2 with 
G {40,80}. The covariates x„ are i.i.d. Gaussian vectors with pairwise correla- 
tion between Xn,i and Xnj given by cor(i, j) = (.5) The noise is Gaussian with 



standard error a G {1,3,6}. Like Zou ( 2006[ ), we consider the relative prediction 



error RPE = E [x^^^. {a ( [X, Y] training) — ck}]/c"^, where the expectation is taken 
over training and test sets, and response. Note that we exactly calculate RPE 
given the training set by using knowledge of the distribution of the covariates (Zou 
relies on 10,000 test observations instead). This predictive measure is reported in 
Table [T] to compare lasso, adaptive lasso and smooth adaptive lasso. To compare 
estimators fairly, we consider the same selection rule for all, here two-fold cross- 
validation. Since the estimators are used on the same 100 training sets, then the 
numbers we see in the tables reveal significant differences, even though marginal 
standard errors are large. We observe that SBITE improves significantly over lasso 
and adaptive lasso when the underlying model is sparse and the noise is small; the 
estimation is slightly worse for the non-sparse model. 

We also consider SURE as a selection rule for lasso and its smooth adaptive ver- 
sion SBITE, that we compare based on their relative prediction error conditional on 
the covariates of the training set, namely RPE | X = Y.n=i E[xJaining,n.{'^([^5 Yjtraining)" 
Q;}]/cr^, where the expectation is taken over the response only. Although less use- 
ful in practice unless prediction is sought at the same locations as the training set, 
this measure helps quantify the improvement of using smooth James-Stein thresh- 
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olding. Table [2] reports the results, which shows a systematic gain of SBITE over 
adaptive lasso. Lasso is often better for the non-sparse model. 

Finally Table |3] reports the number C of selected nonzero components and the 
number / of zero components incorrectly selected. We observe that the selection 
is correct when the noise is small (a = 1) with SBITE and adaptive lasso using 
SURE, but that false detection grows with noise. 



4 Block canonical regression 

We consider block canonical regression, that is when the regression matrix is the 
identity and the coefficients are organized in blocks of size Q, namely. 





n 


] 




1 






( 




\ 


Yn = + e„ with Y„ = 












and e„ = 










\ n 


) 






) 




\ 


AQ) 

n 


J 



(14) 

for n = 1,...,N, where the noise is i.i.d. Gaussian (independent between and 
within sequences). If the standard deviation is not known, then Donoho and 



Johnstone (1994) proposed an efficient estimate based on the median absolute 



deviation; another possibility is to use generalized SURE (13). This setting applies 
to the gravitational wave bursts detection problem ([s]) of Section |2] with Q captors. 
Since X is the identity, rescaling has no impact, that is X = X and /3 



a. 



Block sparsity assumes most blocks q;„ are the 0-vector. SBITE (11) achieves 
block sparsity and has the closed form expression 



in this canonical setting. 



'1 - 



I n||2 



n 



1, 



(15) 



Table 1: Zou's Monte-Carlo simulation with 100 training sets: median RPE using 
two- fold cross-validation to select the hyperparameter(s). 



Model 1 
lasso 
adaptive 
SBITE 



iV = 20 



TV = 60 



iV = 20 



lasso 



0.367, 
0.360 
0.328 



0.048) 
(0.051) 
(0.046) 



0.089 
0.052 
0.054 



(0.009) 
(0.009) 
(0.009) 



0.419 
0.435 
0.424 



(0.069) 
(0.057) 
(0.056) 



iV = 60 
0.089(0.008) 



N = 2{) 



iV = 60 



0.085 
0.085 



(0.009) 
(0.009) 



0.369 
0.308 
0.330 



(0.021) 
(0.021) 
(0.020) 



0.096, 
0.097 
0.098 



0.010) 
(0.011) 
(0.010) 



Model 2 
lasso 
adaptive 
SBITE 



iV = 40 



AT = 80 



iV = 40 



lasso 



0.238(0.014) 
0.238(0.015) 
0.238(0.015) 



0-104(0.005) 
0.104(0.005) 
0.104(0.005) 



0.231(0.020) 
0.233(0.021) 
0.240(0.021) 



iV = 80 
0.108(0.005) 
0.108(0.005) 
0.109(0.005) 



= 40 



iV = 80 



0.163(0.010) 
0.181(0.010) 
0.172(0.010) 



0.087(0.005) 
0.091(0.005) 
0.090(0.005) 
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Table 2: Zou's 
at the training 



Monte-Carlo simulation with 100 training sets: median RPE 
covariates X using SURE to select the hyperparameter(s). 



X 



1 



Model 1 
lasso 

adaptive lasso 
SBITE 
Model 2 
lasso 

adaptive lasso 
SBITE 



iV = 20 



TV = 60 



iV = 20 



iV = 60 



= 20 



iV = 60 



0.264 
0.231 



0.228 



(0.021) 
(0.023) 
(0.023) 
iV = 40 



082 
075 



(0.008) 
(0.008) 
(0.008) 

TV = 80 



0.258 
0.280, 



065 



0.279 



(0.020) 
(0.023) 
(0.025) 
iV = 40 



0.082 
0.090 



0.084 



(0.008) 
(0.008) 
(0.008) 

iV 80 



0.219 
0.289 



0.255 



(0.017) 
(0.019) 
(0.019) 

= 40 



0.082 
0.096, 



0.097, 



(0.007) 
0.007) 
0.007) 

iV = 80 



0.187 



0.191 



(0.010) 
(0.011) 



.092 



.092 



(0.004) 
(0.004) 



0.187 



0.237 



(0.010) 
(0.013) 



0.090 



0.127 



(0.004) 
(0.006) 



0.137 



0.172 



(0.008) 
(0.009) 



0.075 



0.1 
0.1 

0.191(0.011) 0.092(0.004) 0.219(o.oi3) 0.113(o.oo6) 0.164(o.oo8) 0.099(o.oo4) 



0.105 



(0.003) 
(0.004) 



Table 3: Median number of Selected Variables for Model 1 with n = 60 





a = 


1 


a 


= 3 




C 


I 


C 


I 


Truth 


3 





3 





Lasso'f 


3 


2 


3 


2 


Adaptive lasso^ 


3 


1 


3 


1 


SCADt 


3 





3 


1 


Garotte^' 


3 


1 


3 


1.5 


Adaptive lasso SURE 


3 





4 


1 


SBITE SURE 


3 





5 


2 



''' Results taken from the Monte-Carlo simulation of Zou 



(2006). 



4.1 Total variation of SURE 

The Stein unbiased risk estimate also has the closed form expression SURE(A, z/, s) 

E^=l Pn((A, Z/, S), CXn) with 



P„((A,Z/, s),(Xn) = {1 - (1 



Q 



\s T 2 



n\\2 



d{a. 



n I X,u\s 



(<?) 



where 



X,u:s 



dYr 



(q) 



l|Y„||: 



+ 1- 



l|Y„||; 



_ 1 dYr 

if ||Y„||2 < A 
if IIYJU > A 



(16) 



For thresholding functions employing no smoothness, that is s = 1 here, SURE has 



N discontinuity points as a function of A for a fixed u. Indeed ( 16 ) is discontinuous 
at A = II Y„||2 for all n = 1, . . . , when s = 1, and the size of each jump is equal 
to 2i/. We had already observed on the left graph of Figure [2] that the larger 
u the more erratic the SURE surface for the prostate cancer data when s = 1. 
There are two negative consequences for the selection of A and u. First the SURE 
two-dimensional surface will have minima that will be difficult to localize from 
an optimization point-of-view. Second, the location of the global minima will be 
sensitive, in particular with large u, to changes in the data Y„. 
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Donoho and Johnstone (1995) studied the asymptotic properties of selecting 
A by minimizing SURE for u = s = 1. To show their SureShrink estimator is 
optimally smoothness adaptive, a key ingredient is the deviation of SURE around 
its mean when u = 1. Theorem 3 below shows that the total variation of SURE 



Subbotin I penalty: v=.4 
Thresholding function 



James-Stein: v=10, s=] 
Thresholding function ior'A-3 



Smooth James-Stein: v=10, s=2log v+1 
Thresholding function for>.=3 




Figure 3: Thresholding functions and corresponding SURE for Q = 1 and i/ = 10 
on simulated data of length N = 1000. Left: Subbotin ii, penalized least squares; 
Middle: James-Stein (s = 1); Right: smooth James-Stein (s = 2\ogh' + 1). Top: 
thresholding functions with parameters chosen to approximate hard thresholding. 
The left one is discontinuous at the threshold, but with a small slope at the thresh- 
old; the middle one has a discontinuous derivative at the threshold; and the right 
one has a smooth change of derivative at the threshold. Bottom: corresponding 
Stein unbiased risk estimate (least smooth curve) and true loss (smoothest curve). 



grows when u increases, and that employing smooth James-Stein thresholding with 
a smoothness parameter larger than one tempers this erratic effect by removing the 
jumps and decreasing the erraticity of SURE. Figure |3] illustrates the advantage 
of increasing s when u gets large on simulated data. We observe that while the 
two thresholding functions for s = 1 and s > 1 only differ slightly, the latter is 
smoother near the threshold value and the corresponding SURE curve is less wiggly 
around the true loss. Section 4^ reports results of a Monte-Carlo simulation that 
quantifies the improvement in mean squared error obtained by adding smoothness. 

A measure of erraticity of SURE that is defined not only for s > 1 but also for 
s = 1 is its total variation as a function of A. The total variation of a function / in 
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the space of functions of bounded variation (that is, not necessarily continuous) is 
TV(/) = supX^j |/(^i+i) ~ /(-^i)!) where the supremum is taken over all possible 
partitions [Aj,Aj_|_i], j = 1,...,M, of the domain of /. (If / is moreover abso- 
lutely continuous, then TV reduces to the more conventional smoothness measure 
TV(/) = \f'{X)\dX.) The following theorem quantifies the erraticity of SURE 
and shows the tempering effect of the smoothness parameter s. 

Theorem 3: Consider SURE(A; u, s) = Y.n=i Pn((A; i^, s), q;„) as a function of A 
for Q = 1. Its total variation for a given v >1 satisfies 

N N 

Tv(^=i)(SURE) = i:Y^'=^\pn) > J2 TV(">^)(p„) > TV(^>i)(SURE). 

n=l n=l 

Moreover erraticity increases less with s > 1 when u or \Yn \ grows since -^TV^^'^^\pn) < 
|_TV(^=i)(p„) and ^TV(^>i)(Pn) < ^Ty'~'='\pn) for F„ > 0. In particular 
when F„ ^ and for u large, then |;TV('>^)(p„) -> 4(1 - 1/s)"-^ > 4exp(-l) 
for s fixed. Letting s grow slowly with u, for instance s(z/) = 21ogi/ + 1, then the 
lower bound 4exp(— 1) is reached to lower erraticity most. 



4.2 Universal threshold and information criterion 



To derive a universal threshold (Donoho and Johnstone 1994) and an information 



criterion for SBITE, we approximate below the distribution of the smallest thresh- 
old Xy that, for a sample y = (Yi, . . . , Y^) of size N, sets to zero all N blocks of 
length Q when the true underlying model is made of zero vectors. Controlling the 
maximum of Xy then leads to a finite sample Aat q and asymptotic X^^q universal 
thresholds, and a prior distribution tta for A. 

Assuming Nq(0, Iq) for = 1, . . . , A^, we seek the smallest threshold 

Xn^q such that SBITE estimates the right model with a probability tending to one: 



P((«i); 



■N,C, 



0, 



0) 



P ( max I 

n=l,...,Af ' 



Y 



n||2 



2 < X2 



The distribution of M, 



N 



N 

max^Li 



|Y„||2, where ||Y- 



1 2 i-i-d- 



n\\2 



Xq 



(17) 

r(Q/2,l/2), is 



degenerate. Extreme value theory provides proper rescaling of for c]^{Mn — 
dN{Q)) — >d Gq{x), where Gq{x) = exp(— exp(— x)) is the Gumbel distribution, 
Civ = 2 and dj^{Q) is the root in ^ to 



logAr-logr(g/2) 



g/2)log(e/2)+e/2. 



The normalizing constant c?jv(Q) = 2(logA^ + (Q/2 - l)loglogA^ - logr(g/2)) 



given by Embrechts et al. (1997, p. 156) for the Gamma distribution is the asymp- 
totic root of (18), which provides a good Gumbel approximation when N is large 
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compared to T{Q/2). In that case we define the asymptotic universal threshold 
Xn,q = j2{\ogN + {Q/2)log\ogN - logr(Q/2)), for which (IitI is satisfied since 



max 

n=l,...,Ar 



1 2 — ^Ar,( 



Go(loglogAr) ^ 1 - 1/logA^ 



N- 



1. 



(19) 



Note that we get back the standard universal threshold a/2 log up to a small term 
for Q = 1, and the universal threshold of Sardy (2000) for denoising complex- valued 
signals for Q = 2. When Q gets large however, the proposed normalizing constant 
dniQ) is too far from the exact root to provide a useful approximation, so we find 



the root d]\i(Q) of (18) numerically. The finite sample universal threshold is then 
defined as 



Xn,q = ydN{Q) + ciyioglogN with ciAr(Q) root of (18) 



(20) 



to have the same rate of convergence for all Q with Xn,q in place Xn^q in (19). 

More than a bound the asymptotic Gumbel pivot for leads to a prior 
distribution -Fa (A) = Go{{X'^ — dj^{Q))l2) of the threshold A to reconstruct true 
zero vectors from noisy measurements. When s = 1, Bayes theorem provides the 
joint posterior distribution of the coefficients and the hyperparameters. Taking its 
negative logarithm leads to the following information criterion in the spirit of the 
sparsity 1^ information criterion SL^^IC (Sardy, 2009). 



Definition. Suppose model (14) or model (3.1) of Cai ( 1999[ ) holds. The spar- 
sity weighted ^2 information criterion for the estimation of (cki, . . . , ckjv) and the 
selection of (A, v) with SBITE (15) for s = 1 is defined as 



N ^^1 
O H l|Y„ - OLnWl + A'' ^ ||i.^l l|Q:n||2 



TV 



SL2 IC(q;i, . . . , ckat, a, z/) 



n=l 



n=l II "I 



-QNvXogX - log7rA(A; t^^q) - \ogT[y{v), 



where i^y is a prior for v that we choose Uniform on [1, 00), 7r(A; r 
Fa(A; r) = G'o((A^/r^ — dN{Q))/2) and r is calibrated to r^g = 
to match the asymptotic model consistency when q;„ = for = 1 



An 



F'{X; r) with 
^qHQNi. + 1) 
...,N. 



In practice, one minimizes SL^'IC like AIC or BIG to both select the hyper- 
parameters (A, u) and estimate the sequences q;„, n = 1, . . . ,N. The information 
criterion could also be derived for s > 1 if we knew the definition of SBITE as a 
penalized least squares, which is an open problem. 



Smooth blockwise iterative thresholding 



16 



4.3 Oracle inequality 



Candes (2005) provides an interesting review on oracle inequalities. Here we derive 



an oracle inequality for SBITE employing smooth James-Stein thresholding when 
the block size Q > 2 is fixed. Cai (1999) derived an oracle inequality for block 



sizes increasing with the sample size. The £2 risk for model ( 14 ) is R(q:,c >:) 



J2n=iPn{f^n) = J2n=i'^\\^n — CKnUl- Followiug Douoho and Johustouc (1994), the 



oracle predictive performance of the block diagonal projection estimator a.r 
6nYn, where (5„ G {0, 1} is 




Hence the oracle hyperparameters are 5* = 
corresponding oracle overall risk is R*(5, a.) 



„2 if S 

n|l2' " 



if 5r, 



0, 

1. 



mm 



= 1, . . . , A^, and the 
^Q). The following 



theorem extends the oracle inequality obtained by Donoho and Johnstone (1994) 
and Zou (2006) for Q = s = 1 to block thresholding with Q >2 and s > 1. 



Theorem 4'- For any fixed Q > 2, there exists a sample size iVn such that, for 
all N > No and with the universal threshold Xn,q defined in (20), then SBITE 
defined by (jTsj) for z/ > 1 and s > 1 achieves the oracle inequality 



R(«f™ ,a) < iQ + l + 2us + c.,.,QXi^Q){Q + R*{S,cx)), 



(21) 



where Cu,s,q = max(l + 



s^) and A^, 



2 log AT + Q log log AT - 2 log r(g/2) . 



This result shows we can mimic the overall oracle risk achieved with N oracle 
hyperparameters within a factor of essentially q with the single hyperparameter 
Xn,q- The smallest sample size A^^o for which the inequality holds is quite small in 
practice; more work is needed to get a tight expression. Note that for s = 1, the 



inequality differs from Zou (2006) which had the z/-term in the denominator (this 
seems to be due to an error in djl*{\)/dyi right above (A. 13) p. 1427). This result 
shows that increasing v or s increases the oracle inequality constant. But this does 
not prevent the estimator with z/ > 1 to be oracle (Zou, 2006), which is not true 
for lasso with u = 1. Likewise using a larger s improves predictive performance in 
practice although the oracle inequality constant increases. 



4.4 Application to wave burst detection and estimation 

We employ SBITE blockwise and levelwise to Q = 3 concomitant time series of 
length T = 2^"^ (about 3.27 seconds of recording) to detect gravitational wave 
bursts, as described in Section [2] Taking J = 4 in the wavelet expansion (|4]), 
SBITE has a total of 22 hyperparameters to select (11 levels with two hyper- 
parameters each). Data are pure electronic noise, so we add three proportional 
so-called "injections" at time t = 1500, to mimic a wave burst. 
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Captor 1 



Captor 2 



Captor 3 



50D0 100D0 



5000 1D000 15D00 



5DD0 100D0 1500D 



50D0 100D0 



5000 IDOOO 15D00 



5DD0 100D0 1500D 



Figure 4: A wave burst injection at time t = 1500 added on three real captors 
noise (columnwise). Top: Time series of length T = 2^^. Middle: SBITE employing 
SURE levelwise and blockwise. Bottom: univariate smoothing per captor levelwise. 



Figure |4] shows the data (first line) for each captor (columnwise), the SBIT 
estimate (second line) and estimates employing a univariate smoothing per cap- 
tor (third line). We observe that, as opposed to univariate smoothing, blockwise 
smoothing detects the injection and has no false detection except right before time 
t = lO'OOO. Figure |5] zooms around the injections that are three times larger on the 
second captor, and five times smaller on the third captor. We see that blockwise 
estimation of the injections is better than coordinatewise. 



4.5 Monte-Carlo simulation 



We reproduce the Monte-Carlo simulation of Johnstone and Silverman (2004) for 
Q = I captor to estimate a sparse sequence of length N = 1000 and of varying de- 
grees of sparsity, as measured by the number of nonzero terms taken in {5, 50, 500} 
and by the value of the nonzero terms fi taken in {3, 4, 5, 7}. Table |4] reports esti- 
mated risks of four estimators: SBITE with [s > 1) and without {s = 1) smooth- 
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ness, Subbotin i^, penalized likelihood (Sardy, 2009) and EBayesThresh (Johnstone 



and Silverman, 2004). The results clearly show the superiority of SBITE with 
SURE thanks to more smoothness. In case of extreme sparsity, only SBITE using 
the SL^IC information criterion and EBayesThresh perform better; this drawback 



of SURE has been explained by Donoho and Johnstone (1995 Section 2.4). 

We also perform a Monte-Carlo simulation with now Q = 3 concomitantly 
observed sequences: all three underlying sequences are identical in the location of 
the non zero entries, but not in their amplitude. Since three sequences carry more 
information than a single one, we may hope to distinguish noise from signal with 
a smaller signal-to-noise ratio, so we consider value of the nonzero terms fixed to 

= 1, = 2 and jj,^ = ji taken in {3,4,5,7}. The estimated risks (divided by 
Q to allow some comparison with Q = 1) are reported in Table |4| SBITE with 
smoothness (s > 1) again performs best overall, while the SL^IC selection rule 
performs better than for Q = 1, even more so with very sparse sequences. 



Captor 1 




Captor 2 




Captor 3 



20 40 



20 40 




20 40 



£ o 










1 




20 40 



20 40 60 

time 



20 



Figure 5: Zooming on the "injection" represented by the dotted line (the y-scales 
differs between captors). The estimates are plotted with a continuous line: block- 
wise smoothing (first line), and a univariate smoothing per captor (second line). 
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Table 4: Monte-Carlo simulation for a sequence of length = 1000. Average total 
squared loss of: SBITE using smooth SURE; SBITE with smoothness parameter 
fixed to s = 1 using SURE or SL^IC; the Subbotin(A, z/) posterior mode estimator 



using SURE or SLj,IC; and the EBayesThresh estimator with Cauchy-like prior. 



Number nonzero 
Value nonzero fj, = 






5 






50 








500 






3 


4 


5 


7 


3 


4 


5 


7 


3 


4 


5 


7 


Q = 1 captor 


























SBITE s > 1 


























SURE 


37 


37 


26 


15 


202 


165 


109 


65 


826 


752 


614 


521 


SBITE s = 1 


























SURE 


45 


42 


31 


24 


213 


174 


119 


77 


848 


760 


624 


551 


SL^IC 


39 


41 


23 


6 


380 


389 


213 


54 


3350 


2688 


1532 


532 


Subbotin 


























SURE 


39 


35 


27 


25 


232 


167 


107 


97 


1239 


794 


607 


533 


SL^IC 


38 


36 


19 


9 


356 


296 


132 


59 


849 


831 


839 


859 


EBayesThresh 


37 


36 


19 


8 


268 


177 


104 


77 


924 


899 


831 


743 


Q = 3 eaptors 


























SBITE s > 1 


18 


17 


14 


8.7 


106 


94 


75 


56 


615 


615 


558 


510 


SBITE s = 1 


























SURE 


22 


21 


18 


16 


110 


98 


80 


63 


612 


613 


563 


516 


SL™IC 


19 


18 


11 


5.7 


181 


168 


109 


52 


1600 


1350 


807 


548 



5 Further extensions 



Smooth James-Stein thresholding ( 15 ) relies on the £2 norm, like most block thresh- 



olding we are aware of. This measure may not be appropriate in certain applica- 
tions. Indeed if one wants to measure departure from the zero vector in a sense 
that all entries must be different from zero, then ||Y„||2 in (15) should be replaced 
by minq=i^...^Q l^n'^^l leading to robust SBITE: 



'1 - 



mm 



'7=1, 



1 n 



(9)1 



a simple calculation leads to \n,q = yS/QlogA^ for its corresponding universal 
threshold. Other quantiles, or a norm like the li norm, could also be considered. 
Importantly also, the use of a common threshold A for each block implicitly assumes 
blocks of equal size; if not, then the threshold \{pj) must grow with block size pj. 

For the group lasso. Yuan and Lin ( 2006[ ) provided an approximate degrees of 
freedom. One can instead follow the derivation of Theorem 2 to derive the exact 
one, as we did in Section |3. 4 for adaptive lasso. Consider the smooth generalization 



of adaptive group lasso (10), defined as solution to 



" s- 



with 



c, = i-A7(p^.iinis,ii2: 



J, (22) 



where the coefficients vector is segmented as /3 = (/3, . . . , f3j) G in J blocks of 
respective size pj such that J2j=i Pj = P- Note that ( [22| is the smooth extension of 



Yuan and Lin (2006, (2.4)) taking Kj = Ip using their notation. Deriving SURE 
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in that setting requires calculating the nonzero elements of the block gradient 
Vn/3(Y) of the estimated vector defined by (22) with respect to the data Yn for 
n = 1, . . . ,N. Following similar derivation as in Appendix [B| one finds they are 
defined by a set of linear equations of the form (26), that has a unique solution 
when s > 1. Hence the exact equivalent degrees of freedom of smooth adaptive 
group lasso can be calculated, and in particular for adaptive lasso by letting the 
smoothness parameter tend to one. 



6 Conclusions 

We developed SBITE variable selection defined as the fixed point of an iterative 
sequence employing the smooth James-Stein thresholding function. SBITE can 
be employed blockwise or coordinatewise, and can control sparsity, shrinkage and 
smoothness by means of three parameters. For any combination of these three 
parameters, we have derived the Stein unbiased risk estimate that is smoother 
the larger s for a better selection of the regularization parameters. Letting the 
smoothness parameter tend to one, we obtained the equivalent degrees of freedom 
of lasso, adaptive lasso and group lasso. For block canonical regression, we derived 
a universal rule, an information criterion and an oracle inequality. The estimator 
is promising for gravitational wave burst detection and estimation: we are cur- 
rently conducting an analysis with physicists on several months of recordings to 
quantify type I and type II errors, as well as false discovery rate. Also, in the 



spirit of Park and Hastie (2007), generalized linear models could be regularized 
via smooth James-Stein thresholding. More generally, SBITE can be employed in 
other settings than regression to provide both sparsity and smoothness. 
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A Proof of theorem 1 

Let s > 1 and z/ > 1 be fixed. Let bj = W^*]]"'^ > 0, Cj^k = XjX^, Vj = 
XJY — J2k^j Cj^kf^k for all blocks j = 1, . . . , J. For the Gaussian likelihood, (11) 
is given by /3j = {1 - ^^^y+Crjvj for j = 1, . . . , J. Let F : IR^ ^ IR^ defined 
by F = (/i, ...Jj) with /,(/3) = f3^ - {1 - ^};C^ir„ where /, : IR^ ^ IRW 
for j = 1, . . . , J. For s > 1, F is differentiable on IR^, which is a rectangular 



region. The fundamental global univalence theorem of Gale and Nikaido (1965) 
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states that F is globally univalent on IR^ provided its Jacobian J(/3) is a P-matrix 
(here "P" stands for "positive") for every f3 G IR^. A (not necessarily symmetric) 
real square matrix is a P-matrix if all of its principal minors are positive. If so, 
then the 0-vector in particular has at most one preimage by F and the smooth 

block iterative thresholding estimate f3 = F~^{0) is unique. 

To prove the Jacobian J(/3) of F is a P-matrix, let us determine its entries. 
Clearly J{f3) has ones on its diagonal since does not depend on f3j. For its other 
entries, consider any point /3 G IR^, and let Xq be the set of indices j for which the 
inequality &j||rj|| < X'^ is true; let po = J2jeXoPj ^^nd jo = \^o\- Permuting variables 
if necessary, the satisfied inequalities are for j = 1, . . . , jo- Hence, the first po lines 
of the Jacobian at f3 are the po x P matrix [Ip^ ^pox(p-po)]- For the remaining 
blocks j G {jo + 1, . . . , J}, the Jacobian is a block matrix with blocks 

Jhk = \ '^^^i^ ._, for A; = jo + l,...,^, 

where dj = with 1/wj = 1 — G (0, 1). It is straightforward to show 

that 1 < dj < oo when s > 1. Hence the Jacobian is 



where B is some (P — po) x pq matrix, D-^° = diag^CjJ /dj, j = jo + 1, . . . , J), 
and Cl\ = (X^o)TX^o + H. Since H = diag{{dj - l)Cjj, j = jo + l,...,J) is 
positive definite if Cjj = XjXj > (recall that dj > 1 when s > 1) for all 

j = 1, . . . , J, then is a positive definite matrix when s > 1. Consequently, 
|J(/3)| = I -D"^" 1 1 C^'} I > 0; positivity is also verified for all principal minors of J{(3) 
that have the same structure as (23). So F is an injective function and has 
a unique preimage. The Jacobian being invertible, the implicit function theorem 
guarantees the preimage is also continuously differentiable with respect to the data. 



B Proof of theorem 2 



For Gaussian likelihood, the solution to (11) is the system of nonlinear equations: 



/5p 



{1 



with rp = xJY - XpXg/3g, p = 1, . . . , P. 



(24) 



Hence one finds 
^^(Y) 



if/3p(Y) = 0, 

IIjj. ||2 iPpi^-^np 

xJX_pV„/3(Y)) +Mp) else, ^ ' 
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where 

s{u - l)apn{wp - l)rp 1 - s + swp 
Up = ^ Vp = and l/u7p = 1 — ^ 



p;w; w; 1/3; 



u-l 



Let To = {j5 G {1, . . . , P] : /3p(Y) = 0}, and let be the columns of X with an 



index in Xq. Rewriting (25), the entries of V„/3(Y) =: h„ are 

(V„^(Y))p = 



p G Xo 

'-n.p P G Xq 



where are solution to the following system of |Xo| linear equations: 



xJX^oDj«h^« =a;„,p + ^ forallpGXo. (26) 



^ - - Vp 



Here is the identity matrix except that its pth diagonal element is D^'^p = Vp^ 
The matrix of the linear system (26) is {X-^°)^X-^° which diagonal elements are 



multiplied by -0^°^. Moreover Wp > 1, so all D^'^j > 1 since f{w) = w'^/{l — s + sw) 
satisfies /(I) = 1 and f'{w) > for w > 1. This guarantees existence of a solution 
when s = 1 if the column of X are linearly independent, and when s > 1 
otherwise (i.e., s plays the role of a ridge parameter). 



C Proof of theorem 3 

For s = 1, each p„ is strictly increasing from A = to A = and is then con- 
stant after a jump of size 2u. So TV('=^)(Pn) = Y^ + Au-2 and TV(*=^)(SURE) = 
E^=iTV('=^)(p„). For s > 1, the triangular inequality gives TV(">^)(SURE) < 
I^i^Li TV^"*^^-* (/)„), and simple calculations lead to TV*-"*^"^ ■*(/)„) = 2p„((A„, i/, s), a„) — 
Y^, where A„ is solution to x„ = (1 — AJ^/lYnl'^) with Xn G (0, 1) the unique root to 

d 

—pj(x]u, s),an) oc -sY^(l-x')x'~'^ + (s-l)x'~'^(us(l-x)+x)+x'~'^(-us+l) =0 
ox 

i.e., Y^x{l - x') = u{s - 1) + x{l - us). Hence TV^'>^\pn) = Y^ + Avi'-^ - 
2 - 2F„2£2s < TV('^=i)(Pn). Moreover J^TV(^>^)(p„) = 4r„(l - i^f - 2Y^ < 
^TV^'='\pn) = 2Yn for r„ > 0, and £W'>'\pn) = ^sx^^l - 5„) < 4(1 - 
l/sy-^ < £TV^'=^\pn) = 4. At the limit when y„ tends to zero and for large z/, 
we have -^TV^^'^^\pn) = 4(1 — ^)'*~^ > 4exp(— 1) and the lower bound is reached 
as u grows if for instance s = 2 log u + 1. 
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D Proof of theorem 4 



The SBIT estimator {ctn)x,u;s defined in (15) with Q fixed has risk 



Q + E||(a„)A,.;. - YJl - 2Q + 2E(Y„ - a„f{an)x,u;s 

(27) 



-Q + E||(a„)A,,;,-Y„||^ + 2^E 

9=1 



for all = 1, . . . , A^, where we used Stein's lemma for the last term, and where 

if IIYJU < A 



2_j (Y^'^f ll^n||2 

"1 (y^'^ni - - wk^'V if iiY„ih>A 



and d{ctx,u;s)n/dYj^'^^ is given by (16). From (27) and using the inequality (1 — (1 



eY))^ < s^e^ for < e < 1 and s > 1, one gets two inequalities. First we have 

p„((A, u, s), a„) < -Q + A2P(||Y„||2 < A) + s'X^P{\\Y^\\2 > A) + 2{us + g)P(||Y„||2 > A) 
< Q + 2US + s^A^ 

' (Q + 2us + s^\^){Q/N + Q) if Q > 1 



- \ {Q + 2vs + s^\^){Q/N +\\oLn\\l) if||a„||2>l " 
Second, we show below that 

p„((A,z/,s),a„)<(l + g + A^,Q)(^^)(g/iV+||a„||^) if < i (29) 



(2^ 



for large enough. So putting (28) and (29) together, and summing over all 



n = 1, . . . , N leads to the oracle inequality (21 ) 



To show (29) and complete the proof, note that 



p„((A, Z/, s),CXr 



mnWi-Q+T.EHY^'^ym-ii- ^^^^ 



+2EE{(1 

q=l 



q=l II ^n\\2 

^2 y 



-rr-m\\Yj,>x)} 



lY W'^ 

I ^ n\\2 



V II 1^+2 
J- n II 2 



1 - 



lY 11*^ 

I n 1 1 2 



< \\cxJl + 2{iys + Q)P(||Y„||2 > A) =: \\aJl + '^^^ 



)1(I|Y„||2>A)} 
g{fi; A) (30) 



with gif,;X) = 2Q{1 - exp(-pV2) ^^=0 since ||Y„||^ is non- 

central chi-square with Q degrees of freedom and noncentrality parameter /i^ = 
llcKnIII < 1. Considering even Q's for simplicity, Taylor's expansion gives g{fi; A) < 
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^(0; A) + pLg'{0; A) + /i72 sup^g[o,i) W'i^'. A) | . First 

(3/2-1 /T2 /oAj 

= 2Qexp(-A^„^<,/2) ^ 

< 2Qexp(-A^,,/2) g ^^g^ 

_ 2Q r(Q/2) /3^-^(A|y2)^ 
" Ar(log7V)Q/2^^ + W^+ r(i + l)^' 

where the inequahty stems from the fact that A^q > A^q for all Q > 2 and all 
N>No = exp(r(g/2)V(Q/2-i)), ^^^^ ^2^^^ _ ^2^^^ as ^ 00. Then 



9{0;Xn,q) < ^m + X%^Q + 2 ^ e{j,Q,N)] 



i=2 

< f [2 + A^,Q + 2(g/2 - 2)(1)] < |[g + XI^q], 

since ^(Q/^) < 1 and e(i O N) - (1 + Q/2i°giogiv-iogr(Q/2) y,- r(Q/2) ^ . 

for TV large enough. Second, note that 

,'(.;A) = 2Q.exp(-.V2)exp(-AV2)| ^'^[^ vuIq^+I) ^ 

SO ^'(0; A) = 0. Finally 

/feA) ^ 20 exp(-AV2) exp(-.V2) d - x^) E f ^j^^^^ ^ ^ 

+2gexp(-AV2) exp(-xV2)x2 V \^ 

^ ^ ^ ^ ^ ^ jr^or(j + i)r(j + g/2 + 2) 

< 2Q + 2Qx''S{^-l)<2Q + 2X'' 

with 5 = exp(-AV2) exp(-a;V2) EJLo YlMmTojSS) - ^- inequal- 
ity holds for —g"{x;X). Consequently for N larger than A^o, we have A) < 
Q/N{Q + X%^q) + //72(2g + 2X%^q) for // = ||a„||2 < 1. 
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